← Back to Contents
Note: This page's design, presentation and content have been created and enhanced using Claude (Anthropic's AI assistant) to improve visual quality and educational experience.
Week 11 • Part A • Sub-Lesson 1

🔮 What the Future of AI in Research Might Look Like

A calibrated reading of what is shipping, what is overclaimed, and what is purely aspirational — with three worked cases

Week 11 is organised in two parts. Sub-lessons 11.1–11.3 (Part A) examine the future of AI in research and the institutional response to it. Sub-lessons 11.4–11.6 (Part B) turn to the parallel question of Africa's sovereign AI capacity. The two parts can be read independently — though by the end of the week the connection between them is where the Week 12 capstone lives.

🎯 What We'll Cover

Most public commentary about “the future of AI in research” is a mixture of three very different claims: things that are demonstrably happening; things that are technically real but loudly oversold; and things that are not happening at all but are presented as imminent. The most useful thing a postgraduate researcher can take away from a futures discussion is not a list of predictions but a habit: before you accept a claim about what AI is doing for science, ask which of those three buckets it falls into.

This sub-lesson does that exercise on the most prominent recent claims from the period running up to May 2026. We will look in turn at the genuinely shipping work (AlphaFold-family structure prediction; autonomous materials discovery; biomedical hypothesis systems with wet-lab validation), at one carefully-bounded case study of how a real result gets repackaged into an overclaimed one (the Sakana AI Scientist Nature paper), and at the things that are genuinely not happening yet despite confident pronouncements (end-to-end autonomous science, recursive self-improving research systems, AI as the primary author of high-impact papers).

The point is not to predict five years out — nobody can — but to give you the disposition you need to read AI-research news for the rest of your career without being captured by either hype or anti-hype. Sub-Lesson 11.2 extends the same exercise into the genuinely speculative end of the literature; 11.3 then turns to a connected question: what the institutions around you (journals, funders, peer-review) have done in response to all this, and whether their responses are working.

🧮 A Working Framework: Real / Overclaimed / Aspirational

We have been using a version of this three-bucket frame throughout Weeks 9 and 10. Week 9.3 distinguished what AI is now genuinely strong at from what it merely performs convincingly. Week 10.3 separated agentic capabilities that ship from those that exist mainly in vendor demos. The frame is the same here. The three buckets are not graded by quality — an aspirational claim is not necessarily a dishonest one — but by their relationship to evidence.

✅ Real

A capability that is documented in peer-reviewed literature or independently replicated, with the scope of the claim carefully bounded by the people who made it. Almost always narrower than the headline.

Test: would a working researcher in the field describe the result the same way the press does? If yes, it is real.

⚠️ Overclaimed

A real underlying result that has been rounded up. The paper exists, the experiment worked — but the framing, the headline, or the press release strips away the caveats that the authors themselves wrote into the discussion section.

Test: does the abstract say something importantly more careful than the press release? If yes, the press release is overclaimed.

🚀 Aspirational

A claim about the future presented as if it were the present. Often grounded in real trend lines but extrapolated past where the evidence ends. Includes timelines for AGI, “by 2027” predictions, and the recurring claim that AI scientists will replace PhDs.

Test: is the claim supported by anything other than a graph of past trends and a confident sentence? If not, treat as aspirational.

The rest of this sub-lesson is a worked example of the frame in action. We will take three concrete cases — one in each bucket — and trace each one from its primary source to its public reception.

✅ Real: What Is Actually Shipping in Research-AI

Four lines of work, all peer-reviewed, all with humans verifying outputs in a laboratory, give the clearest publishable picture of what AI is currently contributing to active scientific research. They are not the whole picture — we return below to a second, more diffuse, and arguably larger category of AI-as-collaborator that does not yet fit inside a single headline paper — but they are the cleanest places to anchor a discussion of what is actually working.

Protein structure prediction

The AlphaFold lineage is the cleanest example of an AI capability that is now infrastructure. AlphaFold 3 (Abramson et al., Nature, May 2024) extended the protein-only prediction of AlphaFold 2 to complexes including nucleic acids, small molecules and ions; code and weights were opened to academic users in November 2024. Structural biologists no longer treat predicted structures as exotic.

Real-bucket marker: a working scientist in the field uses it without thinking about it.

Autonomous materials synthesis

The GNoME / A-Lab work (Merchant et al., Szymanski et al., Nature 2023) generated 2.2 million candidate crystal structures, filtered them to 381,000 predicted to be stable, and then ran an autonomous laboratory at Lawrence Berkeley that synthesised dozens of novel materials with minimal human intervention. The synthesis success rate (about 71%) is the genuine number; the “381,000 stable materials” figure is a prediction.

Real-bucket marker: a closed loop from AI prediction to physical experiment, peer-reviewed.

Biomedical hypothesis-and-validation

Google's AI Co-Scientist (Gottweis et al., Nature, May 2026) is the most carefully validated recent example. It paired computational hypothesis generation with three wet-lab biomedical validations carried out by named academic collaborators. We treat this case in detail below.

Real-bucket marker: AI-generated hypotheses tested in physical experiments by humans, with the negatives reported alongside the positives.

End-to-end candidate-to-lab pipelines

FutureHouse's Robin (arXiv:2505.13400, May 2025) identified ripasudil as a candidate for dry age-related macular degeneration and ran phagocytosis assays showing a 7.5× increase in vitro. FutureHouse spun out Edison Scientific in November 2025; its successor system Kosmos reportedly processes 1,500 papers plus tens of thousands of lines of analysis code per run.

Real-bucket marker: published preprint with replicable wet-lab numbers, not just a chat transcript.

🧬 A worked example: Google's AI Co-Scientist, primary-source version

The peer-reviewed Nature paper (Gottweis et al., DOI 10.1038/s41586-026-10644-y) describes a multi-agent system built on Gemini, evaluated against OpenAI o1, o3-mini-high, DeepSeek R1, and Gemini 2.0 Pro Experimental on 203 research goals. On an 11-goal expert-evaluated subset, Co-Scientist outputs received a mean preference rank of 2.36 and ratings of 3.64/5 for novelty and 3.09/5 for impact. So far, so benchmark.

The interesting part is the three wet-lab biomedical validations published alongside:

  • Drug repurposing for acute myeloid leukaemia. Binimetinib (already approved for melanoma) showed a half-maximal inhibitory concentration as low as 2 nM in AML cell lines, against around 180 nM in a non-AML control. The genuinely novel candidate KIRA6, an IRE1α inhibitor for which no prior preclinical AML evidence existed, showed an 18-fold selectivity window between leukaemic and control cells.
  • Liver fibrosis epigenetic targets. Two of three Co-Scientist-predicted compounds, including the already-FDA-approved drug Vorinostat, showed significant anti-fibrotic activity in human hepatic organoids with no observed cellular toxicity (published as Guan et al., Advanced Science, 2025).
  • A bacterial-evolution mechanism. Co-Scientist proposed, in two days and with only minimal background information, that capsid-forming phage-inducible chromosomal islands interact with diverse phage tails to expand bacterial host range — matching the primary discovery of an independent, co-timed experimental study from the Penádés group at the Fleming Initiative / Imperial College London (Penádés et al., Cell 188(23), 6654–6665, 2025) before peer review of either was complete.

Note what made this work: AI-generated hypotheses, prioritised by oncologists and microbiologists, validated in physical experiments by named human collaborators, with the failures (drugs that didn't work, hypotheses that were rejected) reported alongside the successes.

The Co-Scientist paper's own limitations section is worth reading as a model of the calibrated voice this sub-lesson is asking you to adopt. The authors explicitly flag that the system's knowledge is constrained by open-access literature, that there is a systemic lack of access to negative experimental results, and that the validation reported is preliminary. The closing warning — that improper use of such systems without rigorous peer review and guardrails could worsen the scientific reproducibility crisis through low-quality scientific artefacts — is from the authors themselves, in Nature. That is how a major lab paper is supposed to sound. Most press coverage of the same work does not sound like that.

🧠 Beyond the headline papers: AI as a workflow-level collaborator

The four examples above are the clearest publishable demonstrations of what AI now contributes to research. They share a property that makes them legible: each one fits inside a single paper, with a single primary result. But there is a second category of “AI as collaborator” that is harder to point at in headline form and arguably bigger in aggregate effect.

Agentic coding tools — Claude Code, Codex, Cursor's agent modes — have moved over the last twelve months from completing one line of code at a time to running for hours on complex, multi-step research tasks. Used by individual researchers with strong domain judgement and the verification habits covered in Weeks 9 and 10, these tools now meaningfully support experimental code drafting, multi-stage data analysis, theoretical reasoning, debugging across substantial codebases, and large parts of manuscript construction — not just the literature search and code completion the early framing of “AI as narrow-task assistant” suggested. Practitioners report that genuinely intellectually demanding work which previously took months can sometimes be compressed into days or weeks. The active researcher community has begun to describe agentic coding tools as the single most consequential change AI has made to their day-to-day research practice.

The reason this category is harder to evaluate than the four lab-scale examples is that the productivity depends substantially on the researcher: their existing domain skill, their willingness to verify outputs at every meaningful junction, and their judgement about which parts of the work they will and will not delegate. Two researchers using identical tools can have wildly different experiences. The bottleneck for many active researchers is no longer the tool's capability but their own practice around it. For students reading this from the start of their research careers, getting the practice right is now a first-order skill.

Sources worth reading first-hand on what this looks like in practice:

  • Patrick Mineault, Claude Code for Scientists (2026). The single best practitioner write-up of how agentic coding has changed day-to-day neuroscience research workflow, from a working scientist. neuroai.science.
  • Anthropic, Coding Agents in the Social Sciences (March 2026). The baseline survey wave of an ongoing randomised study, sampling 1,260 quantitative social scientists. The headline numbers as of the baseline: 20% of surveyed researchers use coding agents regularly; 97% of those users apply them to data-analysis code; users report starting more projects than non-users at the same career stage. anthropic.com.
  • Ethan Mollick, three essays on One Useful Thing track the recent shift from human-AI “co-intelligence” (humans and AI prompting back-and-forth) to humans managing autonomous agents. On Working with Wizards (11 September 2025) introduces the “wizard” frame — AI systems that produce sophisticated outputs through opaque processes — with the line worth carrying: “competence and opacity rise together”. Mollick's worked example is using GPT-5 Pro to critique his own published job-market paper; the system ran its own verification code (including Monte Carlo analysis) and found a previously unnoticed error linking two tables in the paper that no human reviewer had spotted. Claude Code and What Comes Next (8 January 2026) is the most directly relevant essay for postgraduate researchers experimenting with agentic coding tools, and quotes Andrej Karpathy: “the profession is being dramatically refactored as the bits contributed by the programmer are increasingly sparse and between. I have a sense that I could be 10X more powerful if I just properly string together what has become available over the last ~year and a failure to claim the boost feels decidedly like skill issue.” The Shape of the Thing (12 March 2026) sets out the broader trajectory and the StrongDM “Software Factory” case — a three-person team running an AI-only production-software pipeline under two pre-commitments (“Code must not be written by humans” / “Code must not be reviewed by humans”), with each human engineer spending roughly $1,000 per day on AI tokens. Mollick's 2024 book Co-Intelligence: Living and Working with AI (Penguin) is the standing single-volume background. Wizards · Claude Code · Shape of the Thing.

A useful empirical complement: across recent industry-scale studies (Stack Overflow Developer Survey 2025, JetBrains AI Pulse January 2026, Google DORA 2025), controlled measurements of time savings from agentic coding tools cluster in the 13–55% range on real work tasks — not on toy benchmarks. The headline benchmark trajectory (the original SWE-bench scoring ~2% of issues in 2023, with SWE-bench Verified reaching ~88% by 2026) provides the capability ceiling; the developer surveys provide the operational reality.

🔎 What “real” looks like up close

If you take only one habit from this section, take this one: when a paper makes a striking AI-in-science claim, read the limitations section before reading the abstract. Authors usually mark the boundary of their own claim honestly. Press coverage strips that boundary out. The gap between what the discussion section says and what the headline says is one of the most useful signals you have for separating real from overclaimed.

⚠️ Overclaimed: A Detailed Case Study

Now we take a single, recent, high-profile result — the Sakana AI “AI Scientist” paper in Nature in March 2026 — and trace what it actually says, what the public coverage said it said, and what the gap between those two looks like. This is a worked exercise, not a takedown. The underlying paper is interesting and the authors are careful. The point is that even careful, peer-reviewed work gets routinely rounded up in transit.

📑 What the Sakana paper actually claims

Lu, C., Lu, C., Lange, R. T., Yamada, Y. et al. (2026). Towards end-to-end automation of AI research. Nature 651, 914–919, DOI 10.1038/s41586-026-10265-5. CC BY 4.0 open access. Note the actual title: not “The AI Scientist: Towards Fully Automated AI Research” (the lab blog framing) but the more modest Towards end-to-end automation of AI research.

The headline result, as written in the paper itself, is this: one of three AI-generated manuscripts that the team submitted to the ICLR 2025 ICBINB workshop received average reviewer scores of 6.33 on the ICLR 1–10 scale (individual scores 6, 7, 6) and was ranked in the top 45% of submissions. The workshop's acceptance rate was 70%. The acceptance rate of the main ICLR 2025 conference was 32%. The paper that “passed” reported a negative result, aligned with the workshop's focus on interesting negative findings.

Crucially: all three submissions were withdrawn per the pre-arranged ethics protocol with the workshop organisers and the University of British Columbia's Research Ethics Board. Nothing was actually published. The team also conducted their own internal review and concluded, in their own words, that one of the papers met the bar for workshop publication but none met the higher bar for a main ICLR conference paper.

That is the real claim, and it is a perfectly interesting one. A fully AI-generated machine-learning paper now reaches the lower bar of a top-venue workshop with measurable consistency. The team's separate “Automated Reviewer” (an LLM judging machine-learning papers) shows balanced accuracy of 66–69% against historical accept/reject decisions, comparable to or modestly above human inter-reviewer consistency. The paper also documents a real correlation (R² = 0.517, P < 0.00001) between underlying model release date and generated-paper quality — sometimes loosely called a “scaling law of science”, though it is a correlation across a handful of models, not a physics-style law.

📝 Where the rounding happens

Compare what the paper says with three things you will see in coverage of it:

  • An AI wrote a paper that passed peer review.” True only at workshop level (70% acceptance), at a venue specifically designed for negative results, and the paper was withdrawn.
  • The first fully AI-generated paper to pass a rigorous human peer-review process.” Strips out “workshop”, strips out “one of three”, and strips out “withdrawn per protocol”.
  • AI Scientists are here.” The paper's own limitations section states that the system “cannot yet meet the standards of top-tier publications nor even do so consistently for workshops.” Common failure modes named by the authors include hallucinated citations, naive ideas, and lack of methodological rigour.

There is also an independent critical evaluation (arXiv:2502.14297, “Bold Claims, Mixed Results”) of the system, which is worth reading as a counterweight before forming your own opinion.

The discipline this case is meant to teach is not cynicism. It is the habit of pulling a primary source and reading its own discussion section before forming an opinion about what it means. The Sakana team behaved very honestly: they obtained ethics approval, they pre-committed to withdraw any accepted paper, they reported one acceptance and two rejections out of three submissions, and they wrote a careful limitations section. The rounding happened later, in transit, in summaries written by people who had not read the paper. As a postgraduate researcher in 2026, that “in transit” layer is where most of the AI-in-research claims you will encounter will live. Read past it.

🚀 Aspirational: What Is Not Happening Yet

The aspirational bucket is the one most worth being explicit about, because confident claims about the near future are usually presented as if they were claims about the present. Three of the most prominent are worth naming and bounding.

End-to-end autonomous science

The picture in which an AI system formulates an original research question, designs experiments to answer it, runs them, interprets results, writes the paper, and submits it — with no human in the loop — is not happening, and is not on a credible near-term trajectory. Every system we have looked at above — Co-Scientist, Robin, GNoME/A-Lab, the AI Scientist — was steered by named human collaborators at every critical step. The Co-Scientist paper's own framing is “scientist-in-the-loop”.

Aspirational marker: every demonstration of “autonomy” turns out, on inspection, to have humans selecting research goals, prioritising hypotheses, and verifying outputs.

Recursive self-improvement

The idea that an AI research system substantially improves itself — designing better versions of its own architecture, then using those better versions to design better-still versions — remains a hypothetical capability. Real systems improve when humans deploy better models, write better harnesses, or curate better training data (the Week 10.1 lesson that the harness is the product). A coding agent helping its developers ship faster (as Claude Code reportedly does at Anthropic) is real and useful; it is not the same thing as recursive self-improvement of capability.

Aspirational marker: claims rely on extrapolating compute curves, not on any system that has demonstrably bootstrapped itself.

AI as primary author of high-impact papers

The journals you would want to publish in (Nature, Science, the medical journals, the African Journal of Marine Science) do not permit AI to be listed as an author, and have ethics policies under the International Committee of Medical Journal Editors and the Committee on Publication Ethics that make AI authorship a misconduct matter (we look at these in 11.3). Even the Sakana paper's headline AI-generated workshop manuscript was withdrawn rather than published. The notion of AI as first author is not a near-term operational reality.

Aspirational marker: the institutional rules have, if anything, moved in the opposite direction.

Listing the aspirational claims is not the same as dismissing them. Some of them may turn out to be true on a five-or-ten-year horizon. The point is the same as in Weeks 9 and 10: when you read a confident sentence about what AI “will” do for science, the useful question is not whether it is plausible but whether you would be willing to plan a research project on the assumption. If not, it belongs in the aspirational bucket, and you should make decisions about your own work as if the present, not the projected future, is what you have to work with.

🌍 What This Means for Your Research Trajectory

If you are starting a research project in 2026, you are scoping work that you will be defending at a viva in 2027 or 2028, and publishing from for some years after. The honest framing is that most of what this sub-lesson describes is happening now, or will land inside the next twelve months. Five years out is a different category — genuinely unknowable, and things could be radically different by 2030 in either direction. The relevant horizon for present decisions is the one immediately in front of you, not a hedge across half a decade. Three honest implications follow from that picture.

💡 The disposition this sub-lesson is asking for

You do not need a confident prediction about whether AI will replace researchers in 2030. You need a habit: when you read a claim about what AI is doing for science, ask which bucket it belongs in — real, overclaimed, or aspirational — and pull the primary source if you cannot tell.

If you carry only that into the rest of your research career, you will read AI-research news the way an experienced scientist reads any other field's news: looking for the limitations section, suspicious of press releases, willing to be impressed by carefully bounded results, and not willing to be hurried by aspirational ones.

✏️ A Short Exercise

Before the in-class session, do this exercise. It should take 30–45 minutes.

  1. Pick one of the four real-bucket cases above (AlphaFold 3, GNoME/A-Lab, AI Co-Scientist, or Robin/Kosmos) or one from your own field if you know it better.
  2. Find the primary source. Not a press release, not a blog post, not a news article — the peer-reviewed paper or the verified preprint. Look at the abstract, the limitations section, and the figure captions.
  3. Write two short paragraphs. In the first, describe what the paper actually claims, in terms a careful reader of your discipline would accept. In the second, describe what the AI specifically contributed and what humans still did. Be precise about the boundary.
  4. Find one piece of public coverage of the same work (newspaper, vendor blog, X/Twitter thread, anywhere). Note one specific thing it says that goes beyond what the paper itself supports. This is your “in transit” rounding.
  5. Bring all three to class. We will pool them across the cohort.

📚 Sources & Further Reading

Primary sources used in this sub-lesson:

📄 The peer-reviewed papers

DeepMind AI Co-Scientist — Gottweis, J. et al. (2026). Accelerating scientific discovery with Co-Scientist. Nature. DOI 10.1038/s41586-026-10644-y. Accepted 11 May 2026; published online 19 May 2026.

Liver fibrosis follow-up — Guan, Y. et al. (2025). AI-assisted drug re-purposing for human liver fibrosis. Advanced Science, e08751.

cf-PICI mechanism — Penádés, J. R. et al. (2025). AI mirrors experimental science to uncover a novel mechanism of gene transfer crucial to bacterial evolution. Cell 188(23), 6654–6665.

Sakana AI Scientist — Lu, C., Lu, C., Lange, R. T., Yamada, Y. et al. (2026). Towards end-to-end automation of AI research. Nature 651, 914–919. DOI 10.1038/s41586-026-10265-5. CC BY 4.0.

Independent Sakana critique — “Bold Claims, Mixed Results”, arXiv:2502.14297 (February 2025).

AlphaFold 3 — Abramson, J. et al. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500. DOI 10.1038/s41586-024-07487-w.

GNoME & A-Lab — Merchant, A. et al. (2023). Scaling deep learning for materials discovery. Nature 624, 80–85. Szymanski, N. J. et al. (2023). An autonomous laboratory for the accelerated synthesis of novel materials. Nature 624, 86–91.

FutureHouse RobinarXiv:2505.13400 (May 2025); see also FutureHouse's research announcement at futurehouse.org.

Mineault, P. (2026). Claude Code for Scientists. neuroai.science. Practitioner write-up of how agentic coding has changed day-to-day neuroscience research workflow.

Anthropic (March 2026). Coding Agents in the Social Sciences. Baseline wave of a randomised study, 1,260 quantitative social scientists surveyed. anthropic.com.

Mollick, E. Three essays on One Useful Thing tracking the shift from human-AI co-intelligence to humans managing autonomous agents: On Working with Wizards (11 September 2025), link; Claude Code and What Comes Next (8 January 2026), link; The Shape of the Thing (12 March 2026), link. Background book: Co-Intelligence: Living and Working with AI (Penguin, 2024).

Coming up in 11.2: a reading guide for the genuinely speculative end of the AI-in-research literature — frameworks (Krenn, Wang, Morris), falsifiable forecasts (METR), institutional visioning (Royal Society, Africa Declaration), and wild speculation (Clune, AI 2027, Russell) — with the same calibration habit applied throughout. 11.3 then turns from speculation to the institutional present: what journals and funders have actually done in response to AI in research, and the surprising gap between policy and practice.